CLAIMS 

What is claimed is: 



Hi 
IS 



fit 
in 



i 

2 

3 
4 
5 

6 
7 
8 
9 

1 

2 
3 

4 
5 
6 



A\ method of determining if a query document matches one or more 
documents in a database, the method comprising: 

generating a bit profile of the query document based on the number of 
bits required to encode each of a plurality of rows of pixels in the 
document; and 

comparing\the bit profile of the query document against bit profiles 
associated with a first plurality of documents from the database to 
determine >if the query document matches one or more of the first 
plurality of documents. 

The method of claim ]\furth^r Comprising: 

performing spectral analysis on theJtfit profile of the query document to 
determine global statist .cs^f the query document; and 

comparing the global statists :s\)f the query document against global 
statistics associated witi i a second plurality of documents from the 
database to identify thejfirst plurality of documents. 



1 3. The method of claim 2 wherein performing spectral analysis on the bit 

2 profile to determine global statistics comprises generating an estimation 

3 of a dominant line spacing in the query document. 



1 4. The method of claim 2 wherein performing spectral analysis on the bit 

2 profile to determine global statistics comprises generating an estimation 
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3 of a, proportion of the query document that is text. 

1 5. The method of claim 2 wherein performing spectral analysis on the bit 

2 profile to, determine global statistics comprises generating an estimation 

3 of a location of text in the query document. 
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The method oficlaim 2 wherein performing spectral analysis on the bit 
profile to determine global statistics comprises generating an estimation 
of text concentration in the document, the estimation of text 
concentration indicating a lengthwise measure of a proportion of the 



query document that is 



ext 



7. The method of claim 1 f artneixcomprising precomputing the bit files 



associated with the first 

4 

precomputed bit profiles 



WiiralHv of vdocuments and storing the 
in the database. 



The method of claim 1 wherein comparing the bit profile of the query 
document against bit profiles associateckwith the first plurality of 
documents comprises cross correlating the, bit profile of the query 
document against the bit profiles associatedN^vith the first plurality of 
documents from the database. 
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The method of claim 8 wherein cross correlating the bit profile of the 
query document against the bit profiles associated with the first plurality 
of documents from the database comprises generating respective vector 
products of the bit profile of the query document the r>it profiles 
associated with the first plurality of documents from the database. 
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10. 



The method of claim 9 wherein the query document is determined to 
matcK one or more of the first plurality of documents for which the 
respective vector product exceeds a threshold. 



11. A method or\determining if a query document matches one or more 
documents in aydatabase, the method comprising: 

identifying up enidpoints and down endpoints in the query document, 
the up endpoints representing tops of features in the query 
document and th$ down endpoints representing bottoms of 
features in the quer^ document; 

juery document based on locations 



generating a set of desc 
of the up endpoints 

comparing the set of descriptors Xor ttt 
respective sets of descriptors ^so» 
documents in the database to d 



endpoints; and 

query document against 
iated with the one or more 
fermine if the query document 



matches at least one of the one onmore documents. 



12. The method of claim 11 wherein generating a set of descriptors for the 
query document based on locations of the uf^ endpoints and the down 
endpoints comprises 

identifying text lines in the query document ba&ed on concentrations of 
up endpoints and down endpoints along sca\jlines of the query 
document; and 

generating the set of descriptors based on distances between selected up 
endpoints and selected down endpoints within the text lines in the 
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query document. 

13. The method of claim 12 wherein identifying text lines in the document 
based on concentrations of up endpoints and down endpoints along 
scanlines of rhe document comprises: 

determining the^ number of up endpoints and the number of down 

endpoints tn&t lie on each of the scanlines; and 
identifying respective pairs of scanlines that have a local maximum 

number of up endpoints and a local maximum number of down 

endpoints as text lines. 

14. The method of claim 11 wherein theqdery document is in a compressed 
form in which respective rur^je^ixelsWe encoded in one of a plurality 
of encoding modes, and whetfein\dentiffcation of the up endpoints and 
down endpoints is unaffected py thV encoding mode. 

15. A method of determining if a query document matches one or more 
documents in a database, the method comprising: 

generating a bit profile of the query document based on the number of 
bits required to encode each of a plurality of rows of pixels in the 
query document; \ 

comparing the bit profile of the query document against bit profiles 
associated with a first plurality of documents from the database to 
identify one or more candidate documents; \ 

identifying endpoint features in the query document; \ 

generating a set of descriptors for the query document baseo\on locations 
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Df the endpoint features; and 

comparing the set of descriptors for the query document against 
respective sets of descriptors for the one or more candidate 
documents to determine if the query document matches at least 
one of the one or more candidate documents. 



16. The method ©f claim 15 further comprising 

performing spectral analysis on the bit profile of the query document to 
determine global statistics of the query document; and 

comparing the global statistics of the query document against global 
statistics associated with a second plurality of documents from the 
database to identify the /first plurality of documents, the first 



plurality of docume 
documents. 



its 



beinW a/subset of the second plurality of 



17. The method of claim 16 wherein performing spectral analysis on the bit 
profile to determine global statistics comprises generating an estimation 
of at least one of a dominant lineVpacing in the query document, a 
proportion of the query document that is text, a location of text in the 
query document, and a text concentration. 



18. The method of claim 15 wherein comparing the bit profile of the query 
document against bit profiles associated with the first plurality of 
documents comprises cross correlating the bitWofile of the query 
document against the bit profiles associated witr^the first plurality of 
documents from the database. 
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1 19. A method of generating a set of descriptors for identifying a document, 



2 the method comprising: 

3 identifying up endpoints and down endpoints in the document, the up 

4 endpoints representing tops of features in the document and the 

5 downWdpoints representing bottoms of features in the document; 

6 identifying text lines in the document based on concentrations of up 

7 endpoints\and down endpoints along scanlines of the document; 
P 8 an <* \ 

Ml 9 generating a set ortdescriptors based on distances between selected up 

ca [w 

U 10 endpoints and Wectedcio wn endpoints in the concentrations of up 

^ij 11 endpoints and! Jrown endpoints. 

s I \ \ 

-•J 1 20. The method of claiija 19 ^herein identifying text lines in the document 

|i| 2 based on concentrations of m? endpoints and down endpoints along 

^ 3 scanlines of the document comprises: 

4 determining the number of up ekdpoints and the number of down 

5 endpoints that lie on each of the scanlines; and 

6 identifying respective pairs of scanlintes that have a local maximum 

7 number of up endpoints and a local maximum number of down 

8 endpoints as text lines. \ 

1 21. The method of claim 19 wherein identifying text lines in the document 

2 based on concentrations of up endpoints and down endpoints along 

3 scanlines of the document comprises: \ 

4 determining a dominant line spacing in the document; 
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determining the number of up endpoints and the number of down 
Widpoints that lie on each of the scanlines; and 

identifying as text lines respective scanline pairs in which the 

constituent scanlines are separated by a distance less than the 
dominant line spacing and in which the constituent scanlines 
respectively have a local maximum number of up endpoints and a 
local maximum number of down endpoints as text lines. 

22. The method of claim 21 wherein the dominant line spacing is 
determined based on\spectral analysis of locations of the endpoints in 
the document. \ 

23. The method of claim \9 MjtiKff^omprising generating a respective 
endpoint profile for eapM of the scanlines, the endpoint profile 
including a count of up endpomts identified on the scanline and a 
count of down endpoints identified on the scanline, and wherein 
identifying text lines based on concentrations of up endpoints and down 
endpoints along scanlines of the dociunent comprises reducing all but 
local maximums of the counts of up endpoints and the counts of down 
endpoints in respective endpoint profilesX 

24. The method of claim 19 wherein identifying text lines based on 
concentrations of up endpoints and down endpoints along scanlines of 
the document comprises: \ 

generating a count of up endpoints and a count of down endpoints for 
each of the scanlines; \ 

identifying a first scanline within a locality of scanlines that has the 
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ighest count of up endpoints; 

reducing the count of up endpoints associated with each scanline within 
the Wality of scanlines except the first scanline; 

identifying, a second scanline within the locality of scanlines that has the 
highestvcount of down endpoints; and 

reducing the count of down endpoints associated with each scanline 
within the\locality of scanlines except the second scanline. 

25. The method of claim 24 wherein identifying the first scanline within 
the locality of scanipes that ^as\|he highest count of up endpoints 
comprises: 

determining a domina^ Ime spacing of the document; and 



defining the locality of 
than the dominan 
line spacing. 



mlines to 



be scanlines within a range greater 



line spacing but less than twice the dominant 



26. The method of claim 19 wherein\generating a set of descriptors based on 
distances between selected up enopoints and selected down endpoints 
comprises defining an ascender zone and a descender zone for each of 
the text lines, the selected up endpoints being up endpoints in the 
ascender zone and the selected down endpoints being down endpoints 
in the descender zone. 

27. The method of claim 26 wherein defining a A ascender zone and a 
descender zone for each of the text lines comprises: 



defining a region above an x-height line of a first text line of the text 
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likes to be the ascender zone for the first text line; and 

defining a region below the baseline of the first text line to be the 
descender zone for the first text line. 



1 28. The method of claim 27 wherein the ascender zone of the first text line 

2 is bounded in part by the descender zone for the preceding text line. 
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29 The method of claim 19 wherein generating a set of descriptors based on 
distances betweenWected up endpoints and selected down endpoints 
comprises generating for a first text line of the text lines a first descriptor 
that includes a plurality of/stance measurements, each distance 
measurement indicating dista\ve6 between a reference point and a 



respective endpoint of 
endpoints. 



vsdected up endpoints and the selected down 



1 30. The method of claim 29 ^hereii\ the Reference point is one of the 

2 selected up endpoints and the selected down endpoints. 



1 31. The method of claim 29 wherein eachVdistance measurement indicating 

2 the distance between the reference point\and the respective endpoint is 

3 a relative distance to another endpoint of fi^e selected up endpoints and 

4 the selected down endpoints. 



1 32. The method of claim 19 wherein the document ft in a compressed form 

2 in which respective runs of pixels are encoded in one of a plurality of 

3 encoding modes, and wherein identification of the u£ endpoints and 

4 down endpoints is unaffected by the encoding mode. 
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1 33. TMe method of claim 19 wherein the document has been compressed 

2 usin^Group 4 compression. 
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34. A method\of generating information that can be used to identify a 
document, {he method comprising: 

generating a bit profile based on the number of bits required to encode 
each of a plurality of rows of pixels in the document; and 

performing spectral, analysis on the bit profile to determine global 
statistics of the qocumer(t\ 



35. The method of claim 34 



lerein performing spectral analysis on the bit 



profile to determine globe Dstatisticfe comprises generating an estimation 



of a dominant line spacin 



the document. 



36. The method of claim 35 whereinVenerating an estimation of a 

dominant line spacing comprises generating a power spectrum density 
from the bit profile and calculating the estimation of the dominant line 
spacing from a peak value in the power^spectrum density. 



1 37. The method of claim 34 wherein performing^ spectral analysis on the bit 

2 profile to determine global statistics comprises\generating an estimation 

3 of a proportion of the document that is text. 
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theWoportion of the document based on an energy under a peak value 
in the power spectrum density. 



1 39. The method of claim 34 wherein performing spectral analysis on the bit 

2 profile to aetermine global statistics comprises generating an estimation 

3 of a locationW text in the document. 
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1 40. The method of cmim 39 wherein generating an estimation of a location 

2 of text in the document/comprises: 

applying a bandpassirater tome bit profile to generate a text energy 
profile; and 

determining a centroiji o^f the tfext energy profile to be the estimation of 
the location of text in\the document. 
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41. The method of claim 40 wherei^ applying a bandpass filter to the bit 
profile comprises: 

determining a dominant line spacing frequency of the document; and 

selecting a center frequency of the bandpass filter based on the dominant 
line spacing frequency. 
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42. The method of claim 34 wherein performing, spectral analysis on the bit 
profile to determine global statistics comprisesvgenerating an estimation 
of text concentration in the document, the estimation of text 
concentration indicating a lengthwise measure of a\proportion of the 
document that is text. 
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1 43. The ^method of claim 42 wherein generating an estimation of text 

2 concentration in the document comprises: 

3 , applying\a bandpass filter to the bit profile to generate a text energy 

4 profile; and 

5 determining the estimation of the text concentration based on a length 

6 of the text (energy profile. 
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1 44. An article of manufacture including one or more computer-readable 

2 media that embody a program of instructions to configure a processing 

3 system to determine if a)querjTdpcument matches one or more 

4 documents in a database! herein ^/program of instructions, when 

5 executed by one or more £r^<^ss6rs ii^the processing system, causes the 

6 one or more processors 

7 generate a bit profile of th^| query document based on the number of bits 

8 required to encode eadh of a ph^rality of rows of pixels in the 

9 document; and 

10 compare the bit profile of the query document against bit profiles 

1 1 associated with a first plurality of documents from the database to 

12 determine if the query document matches^ one or more of the first 

13 plurality of documents. 



1 45. The article of claim 44 wherein the one or more computer-readable 

2 media include one or more non-volatile storage device 

1 46. The article of claim 44 wherein the one or more computer-readable 
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media include a propagated data signal. 

47. An article of manufacture including one or more computer-readable 
media that embody a program of instructions to configure a processing 
system to\determine if a query document matches one or more 
documents in a database, wherein the program of instructions, when 
executed by dhe or more processors in the processing system, causes the 
one or more processors to: 

identify up endpoints and/dqwn endpoints in the query document, the 
up endpoints ^presenting tops of features in the query document 
and the down eMpointe^^resenting bottoms of features in the 
query document! \X \ 

generate a set of descriptors for th^ query document based on locations 
of the up endpoinjts and the down endpoints; and 

compare the set of descriptor^ for the query document against respective 
sets of descriptors associated with the one or more documents in 
the database to determine if the query document matches at least 
one of the one or more documents. 

48. An article of manufacture including one or more computer-readable 
media that embody a program of instructions to configure a processing 
system to determine if a query document matches one or more 
documents in a database, wherein the program of instructions, when 
executed by one or more processors in the processing system, causes the 
one or more processors to: \ 

generate a bit profile of the query document based on the number of bits 
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required to encode each of a plurality of rows of pixels in the query 
document; 

compare the bit profile of the query document against bit profiles 

associated with a first plurality of documents from the database to 
identifyvone or more candidate documents; 

identify endpoint features in the query document; 

generate a set ordescriptors for the query document based on locations 
of the endpoint features; and 1 

compare the set of d^ript6r£\for the query document against respective 
sets of descriptorsVpr the o\ie 9rmore candidate documents to 



determine if the qi 
more candidate dc 



tents. 



lent matches at least one of the one or 



49. A data processing syste^L contarising: 

a database of document images;Vnd 

a computer that includes a processing unit and a memory, the memory 
having stored therein a program of instructions to configure the 
computer to determine if a query Mocument matches one or more 
documents in the database, whereinxthe program of instructions, 
when executed by the processing unit \f the computer, causes the 
computer to: 

generate a bit profile of the query documeik based on the number 
r of bits required to encode each of a plurality of rows of pixels 
in the document; and 

compare the bit profile of the query document against bit profiles 
associated with a first plurality of documents fk>m the 
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jtabase to determine if the query document matches one or 
mo^e of the first plurality of documents. 
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50. A data processing system comprising: 
a database of document images; and 

a computer that includes a processing unit and a memory, the memory 
having stored therein a program of instructions to configure the 
computer to determine if a query document matches one or more 
documents in the database, wherein the program of instructions, 



when executed by the pressing unit of the computer, causes the 
computer to: 

identify up endpoints and do' 
the up endpoints represe 
document and the down 



\g tops 



in the query document, 
f features in the query 



representing bottoms of 



features in the query docurrientX 

generate a set of descriptors foi/the queW document based on 
locations of the up endpoints and ttte down endpoints; and 

compare the set of descriptors for the queryMocument against 
respective sets of descriptors associated v^th the one or more 
documents in the database to determine if tf\e query 
document matches at least one of the one onmore documents. 



1 51. A data processing system comprising: 

2 a database of document images; and 

3 a computer that includes a processing unit and a memory, the Memory 

4 having stored therein a program of instructions to configureNthe 
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5 computer to determine if a query document matches one or more 

6 documents in the database, wherein the program of instructions, 

7 when executed by the processing unit of the computer, causes the 

8 computed to: 

9 generate a Mt profile of the query document based on the number 
10 of bits required to encode each of a plurality of rows of pixels 



1 1 in the querV dokoiment; \ 

12 compare the bit proile/oi thg^query document against bit profiles 

13 associated with affirst plurality of documents from the 

14 database to ideiutify\ne or more candidate documents; 



15 identify endpoint features in\the query document; 

16 generate a set of descriptors for the query document based on 

17 locations of the endpoint features; and 

18 compare the set of descriptors for the query document against 

19 respective sets of descriptors for the one or more candidate 

20 documents to determine if the query document matches at 

21 least one of the one or more candidate documents. 
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